Skip to content

Conversation

@venkywonka
Copy link
Collaborator

@venkywonka venkywonka commented May 19, 2025

Description

Add llama-3.1-nemotron-ultra-253b-v1 perf tests converage.
This only adds cpp backend for fp8.
This is because, since a model this big rarely is used in bf16 precision, only resource-efficient to test on fp8. PyT backend fp8 requires pre-quantized fp8 checkpoints (we currently don't have it added).

Invariants

Setting Value
GPUs / TP 8 / 8
Engine dtype bfloat16 → quant-FP8
max_batch_size 64
backend cpp (TRT)
benchmarking backend trtllm-bench

Four sequence profiles were benchmarked:

  • C1 & C2low concurrency: reqs = 8, con = 1
  • C3 & C4high concurrency: reqs = 250, con = 250

Performance Summary

ID Con. Input Output Req/s Output TPS
(tok/s)
Avg Latency
(ms)
TPS/GPU
C1 1 5 000 500 0.074 37.08 13 484 4.64
C2 1 500 2 000 0.020 40.81 49 011 5.10
C3 250 5 000 500 0.453 226.29 388 037 28.29
C4 250 500 2 000 0.387 773.77 433 890 96.72

Latency percentiles

Concurrency = 1 (C1 & C2)
Shape P50 P90 P95 P99 Min Max
5 k × 500 13 620 13 751 13 751 13 751 12 798 13 751
500 × 2 k 48 996 49 121 49 121 49 121 48 943 49 121
Concurrency = 250 (C3 & C4)
Shape P50 P90 P95 P99 Min Max
5 k × 500 391 470 549 776 550 795 551 404 141 348 551 507
500 × 2 k 375 050 645 705 645 794 645 849 182 317 645 852

@venkywonka venkywonka requested review from LarryXFly, Copilot, kaiyux, ruodil, schetlur-nv and tijyojwad and removed request for Copilot and ruodil May 19, 2025 13:35
@venkywonka venkywonka marked this pull request as ready for review May 19, 2025 13:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces performance tests for the Llama-3_1-Nemotron-Ultra-253B-v1 model using the cpp TRT backend to ensure that both low- and high-concurrency scenarios pass within CI limits.

  • Added new test entries with appropriate parameters (max batch size, input/output lengths, concurrency, etc.) in the QA test list.
  • Updated model mapping in test_perf.py to include the new ultra model for both native and Hugging Face identifiers, and appended a build flag when remote code is trusted.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/integration/test_lists/qa/trt_llm_release_perf_test.yml Added new performance test entries for Llama-3_1-Nemotron-Ultra-253B-v1.
tests/integration/defs/perf/test_perf.py Added new model mapping entries and introduced a build flag for TRUST_REMOTE_CODE_MODELS.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@venkywonka venkywonka requested a review from Copilot May 19, 2025 14:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds C++-backend FP8 performance tests for the llama-3.1-nemotron-ultra-253b-v1 model and hooks it into the test runner, including enabling remote code trust for quantized builds

  • New YAML entries in trt_llm_release_perf_test.yml for low/high concurrency FP8 benchmarks
  • Model mapping definitions added in test_perf.py for both C++ and HF backends
  • Auto-enables --trust_remote_code for models listed in TRUST_REMOTE_CODE_MODELS during build

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/integration/test_lists/qa/trt_llm_release_perf_test.yml Added four new perf test invocations covering C1–C4 scenarios for the new Ultra-253B model
tests/integration/defs/perf/test_perf.py Registered llama_v3.1_nemotron_ultra_253b[_hf] mappings and added --trust_remote_code flag
Comments suppressed due to low confidence (3)

tests/integration/defs/perf/test_perf.py:58

  • The repository path uses 'Llama-3_1...' with an underscore instead of the dot notation ('Llama-3.1...'), which is inconsistent with other model paths and may break resolution. Update to match the existing naming convention.
"nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1",

tests/integration/defs/perf/test_perf.py:105

  • The HuggingFace model path uses an underscore in 'Llama-3_1...' instead of 'Llama-3.1...'; this diverges from established naming and may cause lookup failures. Please correct it.
"nvidia/Llama-3_1-Nemotron-Ultra-253B-v1",

tests/integration/defs/perf/test_perf.py:932

  • Automatically setting --trust_remote_code=True can pose security risks if unreviewed code is pulled. Ensure these models are audited or document why remote code trust is safe here.
if self._config.model_name in TRUST_REMOTE_CODE_MODELS:

@venkywonka venkywonka force-pushed the user/venky/ll-nemo-ultra-perf-tests-cpp branch from acbc33c to 02e6b02 Compare May 19, 2025 15:11
Signed-off-by: Venky Ganesh <[email protected]>
@venkywonka venkywonka force-pushed the user/venky/ll-nemo-ultra-perf-tests-cpp branch from eb62a33 to 0bb8c75 Compare May 22, 2025 13:39
@venkywonka
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6157 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6157 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4502 completed with status: 'SUCCESS'

@schetlur-nv schetlur-nv merged commit c713eb5 into NVIDIA:main May 22, 2025
3 checks passed
venkywonka added a commit to venkywonka/TensorRT-LLM that referenced this pull request May 22, 2025
chzblych pushed a commit that referenced this pull request May 28, 2025
darraghdog pushed a commit to darraghdog/TensorRT-LLM that referenced this pull request Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants